1 Guided analysis

1.1 Initial data handling

After reading the gapminder_clean.csv file using read_csv we have the following dataset that will be used for all the analysis in this document:

library(tidyverse)
library(ggplot2)
library(plotly)
library(countrycode)
library(kableExtra)
library(car)
library(DT)

data <- read.csv("gapminder_clean.csv") %>%
  as_tibble() %>%
  rename(co2_emissions = CO2.emissions..metric.tons.per.capita.) %>%
  rename(population_density = Population.density..people.per.sq..km.of.land.area.) %>%
  rename(imports = Imports.of.goods.and.services....of.GDP.) %>%
  rename(energy_use = Energy.use..kg.of.oil.equivalent.per.capita.) %>%
  rename(life_expectancy = Life.expectancy.at.birth..total..years.)

datatable(data, options = list(pageLength = 10, scrollX = "400px"), filter = "top")

1.2 CO2 and GDP per capita

First off, it’s requested that I plot a graph of countries’ CO2 emissions matched with their GDP per capita in 1962. To achieve that, we must filter the data so we get just the data on the relevant year and then we can already plot the requested graph:

data_on_62 <- data %>%
  filter(Year == 1962)

ggplot(data_on_62, aes(x = log10(co2_emissions), y = log10(gdpPercap))) +
  geom_point() +
  labs(
    x = "log10(CO2 emissions (metric tons per capita))", y = "log10(GDP per capita)",
    title = "GDP per capita variation according to CO2 emissions"
  )

And also determine the correlation between these variables.

cor_test_res <- format(cor.test(data_on_62$co2_emissions, data_on_62$gdpPercap)$p.value, digits=3)

With a p-value of 1.13e-46, we can conclude that there is a statistically significant correlation between countries GDP per capita and their CO2 emissions.

It is then requested that I calculate in which year the correlation between the CO2 emissions and GDP per capita is stronger. The correlations are shown in the following table in decrescent order:

cor_table <- data %>%
  select(Country.Name, Year, gdpPercap, co2_emissions) %>%
  drop_na() %>%
  group_by(Year) %>%
  summarize(cor = cor(co2_emissions, gdpPercap))

cor_table <- cbind(Rank = seq(1, 10), cor_table[order(-cor_table$cor), ])

cor_table %>%
  kbl(caption = "Correlation between the CO2 emissions and GDP per capita in its respective years") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Correlation between the CO2 emissions and GDP per capita in its respective years
Rank Year cor
1 1967 0.9387918
2 1962 0.9260817
3 1972 0.8428986
4 1982 0.8166384
5 1987 0.8095531
6 1992 0.8094316
7 1997 0.8081396
8 2002 0.8006421
9 1977 0.7928336
10 2007 0.7204169

And finally, we will plot the same dot plot as before, but coloring the dots according to their continents and sizing them according to their population:

co2_gdp_scatterplot <- data_on_62 %>%
  select(Country.Name, Year, co2_emissions, continent, pop, gdpPercap) %>%
  drop_na() %>%
  ggplot(aes(
    x = log10(co2_emissions),
    y = log10(gdpPercap),
    color = continent,
    size = pop
  )) +
  geom_point(alpha = 0.5) +
  labs(
    x = "CO2 emissions (metric tons per capita)", y = "GDP per capita",
    title = "GDP per capita variation according to CO2 emissions",
  ) +
  scale_color_discrete(name = "Continent") +
  scale_size("", range = c(1, 10))
ggplotly(co2_gdp_scatterplot)

2 Unguided analysis

2.1 Energy use problem

What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’?

The relation we want to analyse it’s relative to a categorical predictor variable and a single quantitative outcome to be evaluated in respect to multiple groups. To determine the adequate statistical test to be implemented in this case, we should resolve whether to use a parametric or non-parametric method. We can do this by evaluating if the variance between groups is equal by applying Levene’s test and if the data is normally distributed by a Shapiro-Wilk test.

levene_res <- format(leveneTest(energy_use ~ continent, data = data)$Pr[1], digits = 3)

shapiro_res <- format(shapiro.test(data$energy_use)$p.value, digits = 3)

So, with a p-value of 4.25e-09 on Levene’s test and 5.06e-48 on Shapiro-Wilk, we can reject both the null hypothesis and conclude that the variances are not equal and that the data is not distributed normally. Taking that into account, the adequate statistical test to be applied is the Kruskal-Wallis test.

kruskal_res <- format(kruskal.test(energy_use ~ continent, data = data)$p.value, digits = 3)

As the p-value found is as small as 3.79e-71, we can rule out the null hypothesis and conclude there’s a statistically significant correlation between continent and per capita energy use.

2.2 Imports problem

Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990?

In contrast, although this case also features a categorical predictor variable and single quantitative outcome, they are grouped according to just two continents. To evaluate which statistical test to use, we must again determine whether the distribution of the data is normal or not using the Shapiro-Wilk test.

relevant_continents <- c("Europe", "Asia")

data_as_eu_after_90 <- data %>%
  select(Country.Name, Year, imports, continent) %>%
  filter(continent %in% relevant_continents, Year > 1990)

shapiro_europe <- format(shapiro.test(data_as_eu_after_90$imports[data_as_eu_after_90$continent == "Europe"])$p.value, digits = 3)
shapiro_asia <- format(shapiro.test(data_as_eu_after_90$imports[data_as_eu_after_90$continent == "Asia"])$p.value, digits = 3)

As the european and asian p-values were respectively 1.36e-05 and 2.31e-08, we can conclude that they are not normally distributed. Therefore, the appropriate statistical test to be applied is Mann-Whitney-Wilcoxon test.

europe_asia_imports_mww <- format(wilcox.test(data_as_eu_after_90$imports[data_as_eu_after_90$continent == "Europe"], data_as_eu_after_90$imports[data_as_eu_after_90$continent == "Asia"])$p.value, digits=3)

As the p-value is 0.787 we cannot rule out the null hypothesis, which means there is no statistically significant difference between Asian and European imports of services and goods after 1990.

2.3 Population density problem

What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years?

We can visualize the question in this problem by plotting a line graph showing the population density evolution across the years:

pop_density_plot <- ggplot(data, aes(x = Year, y = log10(population_density), group = Country.Name, color = Country.Name, label = Country.Name)) +
  geom_line() +
  labs(
    y = "Population density (log10(people/square km of land area))", x = "Year",
    title = "Population density variation according to year per country",
  )
ggplotly(pop_density_plot)

To ascertain which country has the highest average ranking, we can compare them according to a score that consists of the sum of their position in a decreasing ranking of population density spanning all available years divided by the amount of valid datapoints for that country. This score is presented in the following table:

years <- unique(data$Year)
countries <- unique(data$Country.Name[!is.na(data$Country.Name)])

country_iterator <- function(country, year_data) {
  year_data <- year_data %>%
    filter(Country.Name == country)
  if (nrow(year_data)>0) {
    return(year_data$population_density)
  } else {
    return(NA)
  }
}

year_round_calculus <- function(year) {
  year_data <- data %>%
    select(Country.Name, Year, population_density) %>%
    filter(Year == year)
  vector <- sapply(countries, country_iterator, year_data)
  vector <- rank(vector, na.last = 'keep')
  return(vector)
}

valid_datapoints_calculator <- function(country) {
  country_data <- data %>%
    filter(Country.Name == country) %>%
    select(Country.Name, Year, population_density) %>%
    na.omit()
  valid_datapoints <- length(country_data$Year)
  return(valid_datapoints)
}

density_score <- rowSums(sapply(years, year_round_calculus))
valid_datapoints <- sapply(countries, valid_datapoints_calculator)

pop_density_ranking <- (density_score / valid_datapoints) %>%
  as_tibble() %>%
  mutate(country=countries) %>%
  filter(!is.infinite(value)) %>%
  arrange(desc(value))

max_dens <- pop_density_ranking %>%
  na.omit() %>%
  filter(value == max(value)) %>%
  pull(country)

data.frame(Country = pop_density_ranking$country, Score = pop_density_ranking$value) %>%
  datatable()

We can conclude that the locations with the highest average ranking in this period are Macao SAR, China, Monaco.

Having already calculated the population density score, we can best visualize it by building a choropleth:

country_codes <- countrycode(pop_density_ranking$country, origin = "country.name", destination = "iso3c")
pop_dens_df <- data.frame(country = pop_density_ranking$country, rank = pop_density_ranking$value, code = country_codes)

geo_config <- list(
  scope = "world",
  showocean = TRUE,
  oceancolor = toRGB("LightBlue"),
  showland = TRUE,
  landcolor = toRGB("#e5ecf6")
)

density_choropleth <- plot_ly(pop_dens_df, type = "choropleth", locations = pop_dens_df$code, z = pop_dens_df$rank, text = pop_dens_df$country, colors = "Reds") %>%
  layout(title = "Population density ranking dominance in the 1962-2007 period)", geo = geo_config) %>%
  colorbar(title = "Arbitrary units")

density_choropleth

2.4 Life expectancy problem

What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ since 1962?

To calculate this, we must get the difference between the life expectancy at the first data point and the last one for each country. This is shown in the next table, with the presentation of the absolute change of life expectancy in years and the percentage of growth this change represents:

life_expectancy_diff_calculator <- function(country, country_life_exp) {
  country_life_exp <- data %>%
    select(Country.Name, Year, life_expectancy) %>%
    filter(Country.Name == country)
  years <- unique(country_life_exp$Year)
  life_expectancy_years <- round(country_life_exp$life_expectancy[country_life_exp$Year == tail(years, 1)] - country_life_exp$life_expectancy[country_life_exp$Year == head(years, 1)], digits = 2)
  life_expectancy_pchange <- round(((country_life_exp$life_expectancy[country_life_exp$Year == tail(years, 1)] / country_life_exp$life_expectancy[country_life_exp$Year == head(years, 1)]) - 1) * 100, digits = 2)
  return(c(life_expectancy_years, life_expectancy_pchange))
}

life_exp_df <- sapply(countries, life_expectancy_diff_calculator)

life_expectancy_diff <- as.data.frame(countries) %>%
  mutate(years = sapply(life_exp_df[1,], as.double)) %>%
  mutate(change = life_exp_df[2,])

max_le_diff <- life_expectancy_diff %>%
  na.omit() %>%
  filter(years==max(years))

max_le_pchange <- life_expectancy_diff %>%
  na.omit() %>%
  filter(change==max(change))

life_expectancy_diff %>%
  filter(!is.na(years)) %>%
  arrange(desc(years)) %>%
  datatable(colnames = c("Country", "Life expectancy change in years", "Percentage of growth in life expectancy"))

So we can conclude that the location with the greatest absolute increase in life expectancy were the Maldives, with an increase of 36.92 years. The country with the highest relative increase was Bhutan, that saw a growth of 100.32% in its life expectancy.

And again, to better visualize the data, we will plot a cloropleth:

le_country_codes <- countrycode(as.vector(countries), origin = "country.name", destination = "iso3c")
life_expectancy_df <- data.frame(country = life_expectancy_diff$countries, years = life_expectancy_diff$years, code = le_country_codes)

geo_config <- list(
  scope = "world",
  showocean = TRUE,
  oceancolor = toRGB("LightBlue"),
  showland = TRUE,
  landcolor = toRGB("#e5ecf6")
)

le_choropleth <- plot_ly(life_expectancy_df, type = "choropleth", locations = life_expectancy_df$code, z = life_expectancy_df$years, text = life_expectancy_df$country, colors = "RdYlGn") %>%
  layout(title = "Changes in life expectancy in the 1962-2007 period)", geo = geo_config) %>%
  colorbar(title = "Years")

le_choropleth